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Performance  evaluation  of  computer  and 
software  systems  has  become  a  rapidly  growing 
field  with  a  growing  number  of  tools  being 
developed  for  analyzing  various  performance 
aspects  of  such  systems.  The  literature  on 
performance  evaluation  methodologies  has  also 
mushroomed  with  various  proposals  from 
researchers  all  over  the  world.  This  paper 
presents  the  results  of  a  survey  conducted  on 
automated  performance  analysis  tools  for 
computer  systems.  The  paper  also  surveys  some 
of  the  evaluation  methodologies  proposed  by 
various  authors.  The  survey  includes 
measurement  based  tools,  analytical  tools, 
simulation  tools  and  visualization  tools,  and 
describes  their  properties  and  capabilities.  The 
tools  have  been  categorized  based  on  their 
analysis  capabilities  and  include  system- 
oriented,  process-oriented  and  module-oriented 
categories.  The  tools  surveyed  in  this  paper 
incorporate  various  techniques  including 
simulation,  modeling  (Petri  net,  queuing ,  semi- 
markov  etc.),  measurement,  visualization  and 
emulation..  The  vast  number  of  tools  available  to 
developers  of  computer  systems  makes  selection 
of  the  appropriate  tool  an  increasingly  difficult 
task.  This  paper  presents  a  new  methodology  for 
tool  classification  which  the  authors  hope  will 
capture  the  characteristics  of  individual  tools 
better  than  the  standard  table  format.  This  model 
will  also  make  tool  search  based  on  specific 
characteristics  easier  for  the  designers  who  need 
the  appropriate  tools. 


Keywords  computer  aided  performance 
engineering,  performance,  (analytical, 
simulation,  measurement,  visualization)  tools. 


1  INTRODUCTION  : 

The  number  of  tools  that  have  been  developed 
for  computer  system  performance  analysis  is 
overwhelming.  The  underlying  analysis 
methodology,  system  requirements  (hardware 
and  software),  analysis  capabilities  and  front 
ends  makes  each  tool  unique.  This  paper  surveys 
some  of  the  tools  available  both  in  die  industry 
and  research  community  and  discusses  some 
analysis  methodologies  proposed  in  the 
literature. 

Performance  of  a  computer  system  can 
be  estimated  by  one  of  the  three  broad 
categories,  namely  :  analytically,  through 
simulation,  or  by  measurement.  The  analytical 
approach  focuses  on  building  a  static  model  of 
the  system  under  study.  The  resulting  model 
describing  the  behavior  of  the  system,  can  then 
be  solved  to  obtain  performance  estimates.  The 
input  parameters  to  the  model  affect  the  estimates 
obtained.  It  is  important  to  note  that  this 
approach  focuses  on  solving  a  static  model  of 
the  underlying  system.  In  contrast  to  this  are  the 
simulation  and  measurement  based  approaches. 
These  approaches  focus  on  the  dynamic  behavior 
of  the  system.  Simulation  tools  require  that  the 
user  write  the  program  in  a  high  level  simulation 
language  and  provide  system  characteristics  to 
the  tool.  Simulation  tools  provide  developers 
with  a  virtual  machine  on  which  to  execute  their 
code  and  get  performance  estimates.  Thus 
estimates  for  a  target  architecture  can  be  obtained 


by  'Simulating  the  code  on  the  development 
machine.  Measurement  tools  do  not  require  that  a 
model  be  generated  as  the  performance  data  is 
obtained  directly  from  the  underlying  system. 
This  requires  instrumentation  of  the  code  at 
various  levels.  The  techniques  outlined  above 
have  their  advantages  and  disadvantages. 
Analytical  modeling  becomes  increasingly 
intractable  as  the  complexity  of  the  system 
increases.  The  demands  made  by  simulation  tools 
on  processor  time  and  memory  increase  as  the 
complexity  of  the  model  increases.  Measurement 
tools  must  be  as  non-intrusive  as  possible  so  as 
not  to  influence  the  performance  data  obtained. 
The  difficulties  with  the  techniques  has  led  some 
researchers  to  move  towards  hybrid  models 
involving  the  three  techniques.  The  authors 
would  like  to  point  out  that  most  of  the  tools 
surveyed  in  the  paper  use  the  simulation  or 
measurement  approaches. 

In  addition  to  categorizing  the  tools  as 
analytical,  simulation,  or  measurement,  it  is  also 
possible  to  describe  the  tool  in  terms  of  its 
analysis  capabilities.  Several  of  the  tools 
surveyed  in  this  paper  exhibit  analysis 
hierarchies.  A  computer  system  can  be  analyzed 
at  different  levels.  These  levels  are  incorporated 
into  an  hierarchy  that  has  the  system  as  a  single 
entity  at  the  highest  level  and  individual 
processes  (user  programs)  at  the  lowest. 
Individual  processes  are  considered  to  be 
composed  of  a  number  of  modules  and  analysis 
at  the  module  level  focuses  specifically  on  the 
performance  of  a  single  module. 

For  the  purposes  of  this  paper,  we 
classify  the  hierarchy  into  three  levels  namely 
system  level,  process  level  and  module  level.  A 
brief  description  of  the  terminology  used  is 
presented  below : 

•  System  level  :  This  represents  the  highest 
level  of  abstraction,  and  includes  all 
components  of  a  computer  system  including 
hardware  interfaces  to  external  sources. 
Analysis  at  this  level  is  of  large  granularity 
and  focuses  primarily  on  the  performance  of 
the  system  in  terms  of  throughput,  waiting 
time  etc. 

•  Process  level  :  Performance  Analysis  at 
this  level  focuses  on  the  process  itself  and  in 
the  case  of  parallel  and  distributed  systems, 
interaction  among  the  processes  constituting 
such  a  system.  A  number  of  process 
hierarchies  can  exist  depending  on  the 
complexity  of  the  software  system.  This 
level  can  also  be  called  as  the  job  level 


where  a  Job  comprises  of  one  or  more 
processes.  The  operating  system  is  also 
treated  as  a  job  on  the  system  and  can  be 
analyzed  at  this  level. 

•  Module  level :  Analysis  at  the  module  level, 
the  lowest  level  in  the  hierarchy,  is  focused 
primarily  on  the  modules  (procedures  and/or 
functions)  of  the  individual  processes. 

The  tools  surveyed  in  this  paper  can  be 
categorized  in  terms  of  these  hierarchy  levels. 
Many  of  the  tools  surveyed  exhibit  capabilities 
that  would  put  them  in  more  than  one  category  of 
analysis. 

2  SYSTEM  LEVEL  ANALYSIS  : 

2.1  Simulation  Techniques: 

Introduction  to  simulation  : 

A  computer  system’s  performance  can 
be  evaluated  by  various  simulation  techniques 
such  as  emulation,  Monte  Carlo,  trace-driven 
and  discrete  event  simulation.  The  simulation 
approach  can  be  used  to  analyze  complex 
systems  which  are  difficult  to  measure  and  model 
using  analytical  techniques.  An  overview  of  the 
various  simulation  techniques  introduced  above 
follows : 

Monte  Carlo  Simulation : 

A  static  simulation  or  one  without  a 
time  axis  is  called  a  Monte  Carlo  simulation. 
Such  simulations  are  used  to  model  probabilistic 
phenomenon  that  do  not  change  characteristics 
with  time.  These  simulations  require  the 
generation  of  pseudo-random  numbers.  Monte 
Carlo  simulations  are  also  used  for  evaluating 
non-probabilistic  expressions  using  probabilistic 
methods. 

Trace-Driven  Simulation  : 

A  simulation  using  a  trace  as  its  input  is 
a  trace-driven  simulation.  A  trace  is  a  time- 
ordered  record  of  events  on  a  real  system..  Trace- 
driven  simulations  are  quite  common  in 
computer  system  analysis.  They  are  generally 
used  in  analyzing  or  tuning  resource 
management  algorithms.  Trace-driven  simulation 
has  been  used  to  analyze  various  algorithms 
including  paging  algorithms,  cache  analysis, 
CPU  scheduling  algorithms,  deadlock  prevention 
algorithms,  and  algorithms  for  dynamic 
allocation  of  storage.  A  trace  of  the  resource 
demand  is  used  as  an  input  to  the  simulation, 
which  models  different  algorithms.  For  example, 


in  order  to  compare  different  memory 
management  schemes,  a  trace  of  page  reference 
patterns  of  key  programs  can  be  obtained  on  a 
system.  This  trace  can  then  be  used  to  find  the 
optimal  set  of  parameters  for  a  given  memory 
management  algorithm  or  to  compare  different 
algorithms. 

Discrete-Event  Simulations  : 

A  discrete-event  model  represents  a 
process  in  which  the  system  state  changes  in 
distinct  steps.  These  state  changes  are  usually 
characterized  by  the  passage  of  time.  Systems 
that  can  be  described  by  discrete-event  models 
are  those  in  which  resource  contention  and 
allocation  occurs.  Queuing  and  probabilistic 
behavior  are  important  phenomena  encompassed 
by  discrete-event  models.  Computer  systems 
exhibit  such  behavior  and  are  excellent  subjects 
for  discrete-event  simulation. 

All  discrete-event  simulations  have  a 
common  structure  regardless  of  the  system  being 
modeled.  If  a  general-purpose  language  is  used, 
all  the  components  have  to  be  developed  by  the 
analyst.  A  simulation  language  provides  some  of 
the  components  and  leaves  others  for  the  analyst 
to  develop.  Common  components  provided  by 
such  languages  include  an  event  scheduler, 
simulation  clock  and  a  time-advancing 
mechanism,  system  state  variables,  event 
routines,  input  routines,  report  generator, 
initialization  routines,  trace  routines,  dynamic 
memory  management  and  the  main  program. 

There  are  three  approaches  to 
developing  a  discrete-event  simulation  :  the 
event-oriented  approach,  the  process-oriented 
approach,  and  the  activity-oriented  approach.  For 
the  event-oriented  approach,  the  model  is 
described  by  a  series  of  events  between  which 
simulated  time  may  elapse.  An  event  usually 
changes  the  state  of  the  system.  Using  the 
process-oriented  approach,  the  model  is 
described  by  a  number  of  interacting  processes 
which  can  represent  either  independent 
procedures,  where  a  procedure  is  a  sequence  of 
activities  (sometimes  referred  to  as  the 
transaction-oriented  view)  or  resources 
(sometimes  referred  to  as  the  resource-oriented 
view).  A  simulation  using  the  activity-oriented 
approach  is  defined  by  the  number  of  activities 
which  are  executed  when  certain  conditions  are 
met.  Simulation  time  advances  in  increments,  and 
at  each  advance  the  activity  list  is  checked.  All 
activities  scheduled  to  execute  at  a  particular 
time  are  executed. 


Simulation  allows  the  user  to  model 
large  complex  systems  and  hence  is  a  popular 
choice  for  system  level  modeling.  Selecting  a 
proper  language  is  probably  the  most  important 
step  in  the  process  of  developing  a  simulation 
model.  An  incorrect  decision  during  this  step 
may  lead  to  long  development  times,  incomplete 
studies,  and  failures.  There  are  four  choices  :  a 
simulation  language,  a  general-purpose 
language,  extension  of  a  general-purpose 
language,  and  a  simulation  package. 

Simulation  languages  such  as 
SIMULA[33]  and  SIMSCRIPT[191]  have  built- 
in  facilities  for  time  advancing,  event  scheduling, 
entity  manipulation,  random  variate  generation, 
statistical  data  collection,  and  report  generation. 
These  languages  allow  the  analyst  to  spend  more 
time  on  issues  specific  to  the  system  being 
modeled  rather  than  worry  about  issues  that  are 
general  to  all  simulations. 

A  general-purpose  language  such  as  C 
or  FORTRAN  is  chosen  for  simulation  purposes 
primarily  because  of  the  analyst’s  familiarity  with 
the  language.  It  may  also  be  that  deadline 
requirements  do  not  allow  time  for  him  or  her  to 
learn  a  new  simulation  language. 

An  extension  of  a  general-purpose 
language  such  as  GASP  (for  FORTRAN)  is 
another  alternative.  These  extensions  consist  of  a 
collection  of  routines  to  handle  tasks  that  are 
commonly  required  in  simulations  with  the  aim 
of  providing  a  compromise  in  terms  of 
efficiency,  flexibility,  and  portability. 

Simulation  packages  such  as  QNET4 
and  RESQ[127]  allow  the  user  to  define  a  model 
using  a  dialog.  The  packages  have  a  library  of 
data  structures,  routines,  and  algorithms. 

Simulation  languages  can  be  classified 
into  two  categories,  continuous  simulation 
languages  and  discrete-event  simulation 
languages,  based  on  the  types  of  events  they 
simulate.  Continuous  simulation  languages  are 
designed  to  handle  continuous-event  models  that 
are  described  by  differential  equations.  Discrete- 
event  simulation  languages  such  as  SIMULA, 
GPSS,  SIMSCRIPT  and  GASP  are  designed  to 
handle  discrete-state  changes. 

An  Overview  of  Some  Simulation  Languages  : 

This  section  provides  the  reader  with  a 
brief  summary  of  some  of  the  popular  simulation 
languages  and  tools  being  used  in  industry  and 
academia  today.  Simulation  languages  can  fall 
into  one  of  the  two  broad  categories  :  flow- 
oriented  languages,  and  statement-oriented 


languages.  Statement-oriented  simulation 
languages  closely  resemble  general  purpose 
programming  languages  such  as  C  or 
FORTRAN.  Flow-oriented  languages  provide 
flowchart-like  symbols  which  can  be  used  to 
construct  graphs  representing  system  behavior. 
There  also  exists  toolkits  that  serve  to  extend  the 
original  language  with  simulation  capabilities. 

SMPL[125]  is  a  general  purpose 
discrete  event  simulation  library  written  in  C. 
SMPL  is  portable  and  uses  an  event  oriented 
approach. 

YACSIM[105]  is  a  process-oriented 
discrete-event  simulator  implemented  as  an 
extension  of  the  C  programming  language. 

SimPack[75]  is  a  collection  of  C  and 
C++  libraries  and  executable  programs  for 
computer  simulation.  Different  simulation 
algorithms  are  supported  including  discrete-event 
simulation,  continuous  simulation  and  multi¬ 
model  (combined)  simulation.  SimPack  provides 
the  analyst  with  a  set  of  basic  utilities  that  can  be 
built  upon  to  construct  special  purpose 
simulation  languages.  SimPack’s  discrete-event 
simulation  is  an  event-oriented  approach  at  the 
basic  level.  SimPack  also  provides  a  basic  X 
Windows  based  graphical  user  interface. 

SIMULA[33]  is  a  general  purpose 
language  in  the  style  of  ALGOL,  The  language 
supports  object  oriented  features  including 
encapsulation,  inheritance,  and  polymorphism. 
SIMULATION,  a  system  class  of  SIMULA  is  a 
process  oriented  language  supporting  discrete- 
event  simulation. 

CSIM[184]  is  a  process-oriented, 
general  purpose  simulation  toolkit  written  with  C 
language  functions.  The  toolkit  has  been  used  by 
programmers  to  create  and  implement  process- 
oriented,  discrete-event  simulation  models  of 
computer  systems,  and  software  systems 
including  applications  executing  on 
multiprocessor  systems. 

SIM++[224]  is  a  general  purpose 
simulation  language  based  upon  C++,  that 
permits  writing  process-oriented  discrete-event 
simulation  models.  SIM++  is  currently  available 
for  PC/AT  running  Zortech  C++  2.0  or  later,  and 
for  DECstation  3100/5000  running  ULTRDC  4.2 
and  the  AT&T  CC. 

SIMAN[160]  features  simulation  and 
analysis  of  discrete  process  or  event  oriented) 
and  continuous  systems  (algebraic,  difference  or 
differential  equations).  It  is  a  flow  oriented 
language  with  the  system  being  represented  as 


linear  top-down  flow-graphs  which  depict  the 
flow  of  entities  through  the  system. 

SIMSCRIPT  IL5[191]  is  a  general 
purpose  process  oriented  programming  language 
with  structured  programming  constructs.  It 
provides  advanced  GUI  features  to  analysts 
including  pull-down  menus,  push  buttons, 
scrolling  text  windows  and  dynamic  graphs  and 
meters. 

MODSIM  n[27]  is  a  high  level  Modula- 
2  based  object-oriented  language  with  multiple 
inheritance,  message  passing,  dynamic  object 
creation,  dynamic  method  redefinition  and 
separate  compilation  of  modules.  The  compiler 
compiles  the  source  code  to  C.  The  programming 
environment  includes  a  symbolic  debugger,  a 
genealogy  browser  with  a  cross-referencer,  a  file 
manager,  and  a  compilation  manager. 

SLAM  II[153]  is  a  simulation  language 
that  combines  process,  event  and  continuous 
views  on  a  model.  A  simulation  begins  with  a 
network  model  or  flow  diagram  showing  the  flow 
of  entities.  A  SLAM  II  network  is  made  up  of 
nodes  at  which  processing  is  performed. 
Common  functions  are  entering  and  leaving  the 
system,  reserving  resources,  starting  and  stopping 
flows  etc.  Animations  can  be  created  by  first 
designing  the  scene  setup  and  then  writing  a 
script.  Scripts  may  be  written  using  a  forms 
based  system.  The  script  specifies  which 
animation  sequence  should  occur  when  a 
simulation  event  happens. 

HOCUS[163]  is  a  simulation  package 
supporting  discrete-event  simulation  modeling 
using  the  activity-oriented  approach.  Activity 
scanning  centers  on  the  definition  of  activities  in 
a  model.  Entities  are  assumed  to  flow  through  the 
model  waiting  for  other  entities  before  engaging 
in  activities  in  a  certain  order. 

DEMOS[163]  is  an  process-oriented 
discrete-event  simulation  implementation  on  the 
SIMULA  language.  DEMOS  has  a  graphical 
description  language  and  versions  exist  under 
MS-DOS  and  X  Windows,  both  written  in 
SIMULA.  A  Demographer  front  end  as  it  is 
called  supports  hierarchical  modeling  with  sub¬ 
processes,  based  on  extended  activity  diagrams. 

2.2  Modeling  Tools  ; 

Modeling  tools  provide  the  analyst  with 
user  friendly  interfaces  and  ease  the  burden  of 
program  development.  These  tools  can  be  used 
by  the  analyst  at  the  cost  of  flexibility  as  these 
tools  make  assumptions  about  the  type  of  system 


being  modeled.  Modeling  techniques  can  be 
analytic,  numerical,  or  simulation  based. 
Analytical  techniques  provide  general  models 
which  may  be  solved  symbolically  for  the  steady 
state  measures  of  that  system  and  can  be  used  to 
efficiently  explore  ranges  of  parameters. 
Unfortunately,  only  a  very  restricted  set  of 
models  have  such  solutions.  Even  fewer  have 
exact  solutions,  leading  to  the  necessity  of 
fmding  approximation  techniques.  Analytical 
techniques  are  usually  applied  to  queuing 
netw'orks,  where  the  structure  of  the  network 
allows  good  rules  for  fmding  appropriate  models, 
notably  those  in  the  class  known  as  BCMP 
networks. 

Somewhere  between  analytical  and 
simulation  models,  it  is  possible  to  use  numerical 
techniques,  where  the  steady  state  behavior  of  a 
system  is  found  without  detailed  simulation,  but 
only  in  terms  of  a  given  set  of  parameters. 

The  system  level  modeling  tools 
discussed  in  this  paper  into  three  categories  : 
queuing  network  based,  petri-net  based  and  tools 
that  use  other  techniques  to  model  a  system. 

Queuing  network  based  tools  : 

QNAP[212]  uses  a  high  level  textual 
language  for  the  description  of  models  which  is 
then  compiled  into  a  form  solvable  by  a  range  of 
solvers  offered.  Solvers  include  exact  solvers  for 
BCMP  networks,  numerical  solvers  for  less 
restrictive  models,  a  Markovian  solver  for 
reasonable  sized  models  preserving  Markovian 
assumptions  and  a  simulation  solver  for  any 
model  describable  in  the  QNAJP  language.  The 
QNAP  language  is  structured  around  entities 
called  service  centers  which  could  be  simple 
server  nodes  as  in  queuing  networks  or  may  have 
more  complex  behavior,  described  in  an 
algorithmic  language.  Options  for  tracing  and 
debugging  simulations  are  also  supported  in 
current  versions  of  QNAP. 

QNAP  is  now  a  part  of  Simulog’s 
MODLINE[164]  modeling  tool  incorporating 
several  features  developed  during  the  ESPRIT  II 
IMSE  project.  One  of  the  features  is  a  graphical 
user  interface  allowing  models  to  be  built  from  a 
menu  of  symbols  as  a  queuing  network  .  Nodes 
can  be  parameterized  to  define  a  complete  model 
and  service  center  nodes  can  have  QNAP  code 
associated  with  them  ]to  provide  general 
descriptions  of  behavior.  A  second  feature  is  an 
experiment  description  facility  which  allows 
analysts  to  describe  repeated  runs  of  a  model 
with  varying  parameters  and  collects  outputs 


from  each  in  a  systematic  manner.  MODLINE 
provides  features  for  animation  of  simulation 
execution  and  selective  instrumentation  of 
models. 

MACOM[114]  was  developed  to 
support  Markovian  Analysis  of  COMmunication 
systems.  The  model  view  consists  of  sources, 
sinks  and  service  and  control  elements.  MACOM 
allows  desired  measures  and  derived  statistics 
and  input  parameters  (including  experiment 
series)  to  be  defined  by  so  called  evaluation 
descriptions.  A  GUI  is  used  to  construct  the 
models  and  evaluation  descriptions.  MACOM 
solves  models  by  numerical  evaluation  of  the 
markovian  chain,  MACOM  runs  on  SUN 
workstations  and  its  graphical  interface  is 
supported  under  SunView  and  X  Windows. 

The  Computer-Aided  Performance  and 
Reliability  Evaluation  System  (CAPRES)[108]  is 
a  queuing  network-based  tool  for  evaluating  the 
design  of  large-scale,  parallel,  fault-tolerant 
computer  architectures.  CAPRES  uses  the 
analytic  approach  for  estimating  performance, 
and  can  apply  one  of  the  following  techniques  : 
modified  mean  value  analysis;  flow-equivalent 
aggregation  via  Chandy  et  al  's  theorem[21],  or 
modified  linearizer  algorithm.  CAPRES  can 
predict  the  performance,  reliability,  and 
performability  of  a  system  response  times,  queue 
lengths,  nodal  and  system  throughput,  and 
component  utilization. 

The  Graphical  Input  Simulation  Tool 
(GIST)[192]  is  a  transaction-oriented,  discrete- 
event  simulation  tool  for  developing  extended 
queuing  network  models.  GIST  models  are 
passed  through  a  translator  which  generates 
source  code  for  a  simulation  compiler. 
Performance  statistics  for  the  model  are  collected 
including  queue  length,  queue  waiting  time,  and 
number  of  waiting  Jobs. 

The  Performance  Analysis  Workstation 
(PAW)  [133]  is  a  graphical  tool  that  supports  the 
development  of  queuing  network  models.  Models 
are  defined  in  terms  of  nodes  and  uni-directional 
links.  PAW  provides  the  analyst  with  three  tools 
:  a  graphics  editor,  a  text  editor,  and  a  simulator. 
The  graphics  editor  allows  the  user  to  specify  the 
network  topology  as  a  diagram.  Parameters 
associated  with  each  node  are  entered  via  the  text 
editor  through  the  use  of  forms.  The  simulator 
allows  two  modes  of  execution  :  continuous  or 
step.  The  simulator  also  allows  model  execution 
to  be  traced  and  features  the  facility  to  provide 
periodic  snapshots. 


The  Research  Queuing  Package 
(RESQ)[127]  is  a  modeling  tool  that  supports  the 
development  and  analysis  of  extended  queuing 
network  models.  A  model  consists  of  nodes, 
queues,  jobs,  routing  rules,  and  routing  chains 
and  is  specified  through  the  use  of  a  RESQ 
language.  Models  can  be  solved  either 
analytically  usiqg  the  Mean  Value  Analysis 
algorithm,  or  through  simulation.  RESQ  can 
determine  resource  utilization,  throughput,  mean 
queue  lengths,  mean  queue  time,  queue  length 
distributions,  queue  time  distributions,  and 
statistical  analysis  of  tokens. 

Petri  Net  based  tools  : 

GreatSPN[46]  has  evolved  from  a  fairly 
simple  tool  for  graphical  construction  and 
numerical  solution  of  GSPNs.  Model 
construction  is  supported  by  placing  and  linking 
icons  from  a  menu,  representing  places, 
transitions  and  arcs.  The  resulting  net  may  be 
analyzed  for  structural  and  behavioral  properties, 
such  as  deadlocks  and  invariants.  The  model  is 
also  solved  by  numerical  techniques  based  on 
generating  the  underlying  Markov  chain. 

DSPN[122]  Express  is  offered  by  the 
Technical  University  of  Berlin.  DSPN  Express 
allows  numerical  solution  of  models 
incorporating  deterministic  time  delays  in 
transitions.  DSPN  Express  does  not  support 
simulation. 

QPN[25]  is  a  Petri  net  modeling  tool 
from  the  University  of  Dortmund.  The  tool  is 
similar  in  its  general  appearance  to  GreatSPN.  It 
supports  timed  places  as  well  as  timed 
transitions.  A  timed  place  corresponds  to  a 
service  station  of  a  queuing  network  for  which  a 
Petri  net  equivalent  is  known.  Solution  of  the 
underlying  Markov  chain  is  performed  with 
Usenum,  a  package  for  solution  of  large  Markov 
chains,  also  developed  at  the  University  of 
Dortmund. 

ADAS  [4]  is  an  integrated  set  of  Petri 
net-based  tools  that  supports  the  development  of 
hierarchical  models.  The  tool  set  includes  :  a 
graph  editor  that  is  used  to  create  and  modify 
directed  graphs;  a  Petri  net  simulator  that 
verifies  the  correctness  of  a  software  directed 
graph  by  converting  it  into  a  Petri  net  and 
simulating  it;  a  Petri  Net  Analyzer  that  processes 
the  results  of  the  Petri  net  simulator  and  produces 
performance  analysis  reports;  a  high  level 
hardware  description  language  that  verifies  the 
correctness  of  the  hardware  graph  by  generating 
a  HDL  program  and  simulating  it;  and  a  software 


functional  simulator  that  supports  the 
development  of  either  C  or  Ada  modules  for 
modeling  the  software  operations  associated  with 
a  graph  node. 

Modeler  is  a  Stochastic  Timed 
Attributed  Petri  Net  (STAPN)-based  simulation 
tool  that  provides  a  GUI  based  environment  for 
model  development.  STAPN  is  an  extension  of 
Petri  nets  that  supports  branching,  time  delays, 
and  inhibitors  that  can  prevent  a  node  from  firing 
even  when  all  required  inputs  are  enabled. 
Modeler  as  presented  in  [35]  is  a  prototype 
model  that  has  limitations  on  the  number  of  input 
and  output  nodes,  a  limited  statistical  output,  and 
limited  ability  in  displaying  model 
characteristics. 

The  System  Architects  Apprentice 
(SARA)[66]  is  an  environment  for  the  analysis  of 
concurrent  systems.  The  tool  set  provided  with 
SARA  includes  :  a  structure  language  (SL)  that 
provides  a  set  of  primitives  for  defining  a 
model’s  structure  in  a  nested,  hierarchical 
manner.  SL  is  responsible  for  managing 
resources  and  ensuring  that  the  interface  between 
modules  remains  consistent;  a  Module  Interface 
Description  (MID)  that  acts  as  a  support  tool  for 
SL  which  helps  establish  accessibility  of 
resources;  a  Graph  Model  of  Behavior  (GMB) 
that  provides  primitives  akin  to  Petri  nets  for 
specifying  and  analyzing  the  control  and  data 
flow  behavior  of  a  system;  and  a  model  library 
that  has  facilities  for  storage  and  retrieval  of 
model  components.  SARA  calculates 
performance  parameters  comprising  of  mean 
utilization,  mean  queue  size,  mean  waiting  time, 
queue  size  distributions,  and  confidence 
measures  for  all  modeled  resources. 


Miscellaneous  Modeling  Tools  : 

This  section  presents  some  system 
modeling  tools  that  do  not  use  the  classical 
techniques  of  modeling.  The  tools  presented  here 
are  based  on  formal  language  specifications, 
performance  process  algebras  and  other  recent 
modeling  techniques. 

SimPar[94]  is  a  modeling  environment 
for  the  performability  analysis  of  massively 
parallel  computer  systems.  Performability 
analysis  in  SimPar  comprises  both  performance 
and  dependability  analysis  by  considering  the 
performance  degradation  in  the  presence  of 
component  failures.  SimPar  uses  the  process- 
based  simulation  engine  and  the  error  injection 
capabilities  of  the  DEPEND  tool[176].  SimPar 


uses  a  technique  called  conjoint  simulation[90] 
which  is  based  on  the  partitioning  of  the  system 
model  and  on  the  combination  of  various 
modeling  techniques.  A  so-called  architecture- 
workload  model  (AWM)  comprises  the 
architecture  and  workload  of  the  target  system 
and  relies  on  the  object-oriented  and  process- 
based  paradigms.  A  failure-repair  model  (FRM) 
represents  the  occurrence  of  component  failures 
and  control  of  fault-tolerance  and  maintenance 
mechanisms.  Performability  analysis  involves 
conjoint  simulation  of  the  AWM  and  FRM 
models. 

The  Parallel  Architecture  Research  and 
Evaluation  Tool  (PARET)[150]  is  an  interactive, 
animated  environment  for  analyzing 
multicomputer  systems.  Modeling  a  system  in 
PARET  comprises  developing  three  separate 
specifications  :  characterization  of  the 
application  software,  characterization  of  system 
ftmctions,  characterization  of  the 
interconnections.  All  specifications  are  modeled 
as  directed  flow  graph  objects.  PARET  supports 
animation  and  interactive  monitoring  of  the 
simulation  of  the  model. 

Process  algebras  have  evolved  recently 
to  address  some  of  the  shortcomings  of  simple 
Petri  nets  for  behavioral  analysis  of  computer 
systems.  Formal  protocol  languages  and  system 
description  languages  that  have  often  been 
heavily  influenced  by  process  algebras  are  being 
investigated  as  a  means  for  providing 
performance  models  directly  from  system 
descriptions  and  specifications.  Early 
experiments  incorporating  such  an  approach 
include  TIPP[88]  from  Universitat  Erlangen- 
Nurenberg,  and  PEPA[84]  from  the  University  of 
Edinburgh.  TIPP  was  an  attempt  at  a 
performance  modeling  extension  to  a  process 
algebra.  TIPP  is  similar  in  its  algebraic  notation 
to  Milner’s  Calculus  of  Communicating  systems 
(CCS).  TIPP  has  demonstrated  that  the  notation 
could  express  models  outside  those  easily  dealt 
with  by  previous  performance  modeling 
formalisms  and  also  the  potential  for  solving 
such  models  by  numerical  techniques.  The 
PEPA  (Performance  Estimation  Process  Algebra) 
is  also  similar  to  CCS  in  its  algebraic  structure.  It 
adds  to  the  behavioral  analysis  capabilities  of 
CCS  by  being  able  to  generate  a  Markov  chain 
from  the  state  transition  model  underlying  the 
algebraic  description.  The  PEPA  workbench 
allows  models  written  in  PEPA  to  be  entered  and 
their  underlying  Markov  Chain  to  be  generated  in 
a  form  suitable  for  processing  by  a  backend 


written  using  the  Maple  computer  algebra 
package. 

A  number  of  groups  in  Europe  are 
experimenting  with  the  automatic  generation  of 
performance  models  from  formal  specifications. 
LOTOS  is  the  CCITT  recommended  protocol 
specification  language  based  on  process  algebra 
and  combining  the  features  of  CCS  and  Hoare’s 
Communicating  Sequential  Processes  (CSP). 
QNAP  models  have  been  generated  from 
LOTOS  specifications  as  a  part  of  the  ongoing 
ESPRIT  project  [213]. 

2.3  Measurement  Tools : 

The  preferred  method  of  evaluating 
computer  systems  through  the  measurement 
approach  is  through  the  use  of  benchmarks.  A 
benchmark  is  a  set  of  executable  instructions 
which  may  be  used  to  compare  the  relative 
performance  of  two  or  more  computer  systems.  A 
benchmark  is  usually  composed  of  computer 
programs,  but  may  also  include  scripts  of 
narrative  instructions  that  direct  a  person  or  a 
machine  to  perform  certain  specific  tasks  during 
the  course  of  the  comparison  test.  The  process  of 
benchmarking  is  conducting  controlled 
experiments  to  collect  measures  of  system 
performance  which  may  be  compared  from  one 
system  to  another. 

Numerous  benchmarking  suites  are 
available  for  most  commercial  computer  systems. 
Some  well  known  benchmarks  are  described  in 
[102].  An  overview  of  some  benchmarking 
techniques  is  presented  next. 

The  5/eve  kernel  has  been  used  to 
compare  microprocessors,  personal  computers, 
and  high-level  languages.  It  is  based  on 
Eratosthenes’  sieve  algorithm  and  is  used  to  find 
all  prime  numbers  below  a  given  number  n. 
The  Ackermann  function  has  been  used  to 
assess  the  efficiency  of  the  procedure-calling 
mechanism  in  ALGOL-like  languages.  The 
average  execution  time  per  call,  the  number  of 
instructions  executed  per  call,  and  the  amount  of 
stack  space  required  for  each  call  are  used  to 
compare  various  systems.  The  Whetstone  suite 
exercises  such  processor  features  as  array 
addressing,  fixed-  and  floating-point  arithmetic, 
subroutine  calls  and  parameter  passing.  The 
LINPACK  suite  consists  of  a  number  of 
programs  that  solve  dense  systems  of  linear 
equations.  The  LINPACK  benchmarks  are 
compared  based  on  the  execution  rate  as 
measured  in  MFLOPS.  The  Dhrystone  kernel 


contains  a  number  of  procedure  calls  considered 
to  represent  systems  programming  environments. 
The  benchmark  is  a  measure  of  integer 
performance;  it  does  not  exercise  floating-point 
or  I/O  processing.  The  SPEC  (Systems 
Performance  Evaluation  Cooperative)  bench 
mark  suite  stresses  primarily  the  CPU,  Floating 
Point  Unit  (FPU),  and  the  memory  subsystem. 

Measurement  of  a  computer  system  can 
be  performed  by  hardware  or  software  monitors. 
Some  hardware  systems  offer  facilities  that  can 
be  used  for  analyzing  performance  parameters, 
like  special  counters  for  recording  events.  The 
information  can  then  be  read  by  the  monitoring 
software  and  processed.  When  dealing  with 
events  on  a  bus  or  network  link,  special  hardware 
is  required.  In  micro-coded  architecture, 
monitoring  facilities  could  be  provided  at  that 
level  for  event  capture.  Software  monitoring  can 
be  provided  at  many  levels  -  recording  very  low 
level  activities  like  disk  accesses  etc.,  at 
intermediate  levels  such  a  operating  system  calls, 
or  at  high  levels,  recording  application  level 
activity  such  as  database  requests.  Mapping 
requests  at  one  level  onto  requests  at  another  is  a 
difficult  activity  at  best.  In  the  following 
paragraphs,  We  survey  some  tools  that  are 
specifically  oriented  towards  monitoring 
hardware  level  performance. 

The  Test  and  Measurement  Processor 
(TMP)[220]  is  a  multicomputer  monitoring 
facility  that  monitors  the  behavior  of  a 
MC68000-based  distributed  system.  TMP 
consists  of  a  host  test  station  and  a  set  of  local 
monitors,  all  interconnected  via  a  monitoring 
network.  Each  local  monitor  contains  a 
MC68000,  an  I/O  unit  for  output  to  a  terminal  or 
printer,  a  network  interface  unit,  and  an  event 
processing  unit.  The  local  monitors  observe  and 
record  the  bus  traffic  of  the  local  processor  and 
produce  performance  summaries.  Summaries, 
which  can  include  the  number  of  messages 
transmitted  and  received,  elapsed  time,  execution 
times,  idle  times,  etc.  can  be  sent  to  the  host  test 
station  to  be  displayed.  The  TMP  can  monitor 
and  produce  performance  summaries  at  all  levels 
of  the  hierarchy  (system,  process  and  module). 

The  Vax  8800[47]  Monitor  is  a 
hardware  monitor  that  collects  data  on  the  8800 
processor’s  program  counter  and  memory  bus 
status.  The  Vax  8800  monitor  comprises  of  two 
modules  :  a  histogram  module  and  a  Digital 
DMF-32  synchronous  parallel  interface  module. 
The  histogram  module  is  responsible  for 
maintaining  a  count  of  all  machine  cycles 


executed  within  the  8800  processor.  The 
histogram  module  can  also  keep  track  of  stalled 
cycles  and  the  status  of  the  memory/IO  bus  at 
each  clock  cycle.  The  DMF-32  module  provides 
an  interface  to  the  histogram  module  for 
initialization  control  and  downloading  of 
histogram  data.  The  following  performance 
parameters  can  be  determined  based  on  the 
collected  data  :  opcode  execution  frequencies, 
operand  specifier  frequency  distributions, 
frequency  of  reads  and  writes  per  instruction, 
frequency  of  events  on  the  memory/IO  bus,  read 
and  write  hit  ratios,  and  stalled  cycles  per 
microinstruction. 

Zahlmonitor  4  [61]  is  a  measurement 
environment  for  monitoring  multiprocessor 
systems.  It  includes  a  set  of  hardware  probe  units 
for  the  object  system  components  and  a  PC 
attached  to  each  probe  unit.  The  PC’s  are 
connected  to  a  central  control  and  evaluator 
station.  A  global  time  base  is  maintained  via 
tightly  coupled  clocks  synchronized  by  hardware. 

Sterling  et  al.  describe  a  hardware 
monitor  (the  Degradation  due  to  Latency  and 
Arbitration  (DLA)  device)  for  the  CONCERT 
multiprocessor  in  [204].  Several  sources  of 
performance  degradation  are  identified  namely 
(1)  insufficient  parallelism  in  the  application,  (2) 
contention  for  shared  resources,  (3)  overhead 
imposed  by  partitioning  the  problem,  and  (4) 
latency  of  access  to  objects.  DLA  was  designed 
to  measure  the  effect  of  contention  and  latency  at 
the  hardware  level.  The  DLA  monitor  was 
capable  of  accumulating  statistics  in  real  time 
concerning  bus  utilization,  bus  requester  wait 
time,  memory  access  latency,  and  contention  for 
software  level  semaphores. 

REMS  (Resource  Measurement  System) 
[43]  is  a  tool  to  aid  in  the  analysis  and 
measurement  of  hardware  performance  for 
shared  bus  multiprocessors.  Events  of  interest 
include  low  level  hardware  activities  such  as 
memory  access,  cache  access,  I/O  operations, 
and  queueing  for  shared  hardware  resources. 
REMS  is  composed  of  a  set  of  sample  units 
connected  to  an  analysis  subsystem.  The  sample 
units  compare  the  state  of  a  set  of  signal  lines 
(connected  to  the  system  under  testing)  to  a  set  of 
patterns.  A  pattern  matching  hardware  allows  for 
fast  comparison  of  an  incoming  pattern  with  a  set 
of  patterns  of  interest.  A  pattern  match  may 
initiate  other  recording  activities  including 
counter  sampling  etc. 

TRAMS[43]  (TRAce  Measurement 
System)  has  been  developed  at  the  National 


Bureau  of  Standards.  Events  of  interest  are 
marked  by  writing  to  a  location  in  the  address  of 
each  process.  The  data  written  to  the  address  is 
then  recorded  by  the  measurement  hardware 
along  with  a  32  bit  time  stamp,  the  processor 
number,  and  the  execution  mode  (user  or  system) 
of  the  processor. 

ATUM  (Address  Tracing  Using 
Microcode)[193]  collects  traces  of  addresses 
issued  from  every  instruction  executed  by  a  VAX 
8350  multiprocessor.  ATUM  is  composed 
entirely  of  microcode  that  augments  the  standard 
8350  microcode.  As  each  memory  request  is 
issued  by  the  processor,  ATUM  writes  a  record 
of  the  request,  including  the  virtual  address  and 
the  type  of  access,  to  a  block  of  memory  reserved 
for  ATUM  use.  Traces  from  ATUM  experiments 
have  been  used  in  studies  of  cache  performance 
and  to  support  cache  models  and  other 
performance  models  that  rely  on  memory 
reference  patterns. 

3  PROCESS  LEVEL  ANALYSIS  : 

This  section  surveys  tools  and 
techniques  that  have  been  developed  specifically 
to  analyze  and  predict  the  performance  of  a 
computer  system  at  the  process  level.  It  should 
be  noted  that  a  number  of  the  system  level  tools 
discussed  in  previous  sections  can  also  be  used 
for  this  purpose.  A  number  of  these  tools  use 
analytical  techniques  to  obtain  a  rough  estimate 
of  the  performance  parameters  and  simulation 
techniques  during  advanced  stages  of  the 
prediction  process. 

3,1  Modeling  Tools  and  Techniques  : 

Modeling  tools  (analytic  and  simulation 
based)  and  techniques  have  been  developed  to 
predict  the  performance  of  software  systems. 
There  has  been  extensive  work  done  in 
performance  prediction  based  on  statistical  and 
probability  theory  methods.  Parallel  programs 
can  be  modeled  in  terms  of  distribution 
functions,  random  variables,  regression  models, 
stochastic  processes,  markov  processes  and 
chains,  queuing  networks,  petri-nets  etc.  and 
performance  parameters  can  be  obtained. 
F.Sotz[198]  provides  an  approximation 
technique  to  estimate  the  runtime  of  a  parallel 
program  which  is  modeled  as  a  stochastic  graph. 
Tasks  represent  nodes  in  the  graph.  The  runtime 
variables  of  the  tasks  can  be  distributed 
deterministically  or  exponentially.  The  technique 
is  based  on  transient  state  space  analysis.  N. 
Yazici-Pekergin  and  J.M.  Vincent  obtain 


stochastic  bounds  on  execution  times  of  parallel 
programs  assuming  the  availability  of  an 
unlimited  number  of  processors.  The  execution 
times  of  parallel  tasks  are  random  variables 
distributed  identically.  F.  Hartleb  and  V. 
Mertsiotakis[92]  derive  upper  and  lower  bounds 
for  parallel  programs  as  a  means  to  experiment 
with  mapping  and  implementation  alternatives. 
Parallel  programs  are  modeled  as  a  stochastic 
graph  and  the  runtime  behavior  of  a  specific 
processor  is  described  by  a  random  variable. 
Simulation  techniques  such  as  emulation,  Monte 
Carlo,  trace-driven  and  discrete-event  simulation 
can  be  used  to  evaluate  the  performance  of 
program  models  (graph,  queuing  models,  petri- 
net  models  etc.).  J.  Prost  and  S.  Kipnis[166] 
describe  a  multi-level  trace-driven  simulation 
approach  in  order  to  analyze  the  performance  of 
programs  for  distributed  memory  parallel 
systems.  The  trace  consists  of  a  sequence  of 
events  to  be  simulated.  A  parameterized  model  of 
the  target  architecture  is  incorporated  and  four 
hierarchical  simulation  levels  allow  the  user  to 
examine  performance  parameters  at  the  user, 
library  and  communication  levels,  J.  Bruner  et  al. 
[38]  create  instrumented  profile  runs  of  a  parallel 
program,  which  serve  as  input  to  an  event-driven 
simulator.  This  approach  is  aimed  at  determining 
the  maximum  available  parallelism  in  a  program. 
A  Computer  Architecture  Research  Language 
(CARL)  is  used  to  model  the  underlying 
architecture.  H.  Mierendorff  et  al.[136]  evaluate 
the  performance  of  parallel  programs  on 
distributed  memory  multi-processor  systems. 
They  introduce  an  analytical  approach 

considering  message  routing,  algorithm  structure 
and  data  mapping.  The  tool  developed  can  model 
large  systems  both  in  terms  of  architecture  and 
algorithms. 

Benchmarking  models,  long  used  for 
performance  measurement  at  the  system  level  is 
now  a  popular  performance  prediction  approach 
to  support  the  optimization  effort  at  the  process 
level.  V.  Sarkar[180]  describes  a  general 

framework  for  determining  average  program 
execution  times  in  the  PTRAN  project  by  using 
frequency  information  and  pre-measured 
execution  times  of  primitive  operations. 

Balasundaram  et  al.[23]  describe  a  performance 
estimator  to  select  a  data  distribution  strategy 
based  on  runtime  information.  It  is  limited  to 
programs  utilizing  the  loosely  synchronous 

communication  model.  A  set  of  kernels  for 
operations  on  a  single  processor,  and  loosely 
synchronous  collective  communication  routines 


on  a  parallel  architecture  are  incorporated  to  train 
the  estimator.  A  parallel  program  is  parsed  for 
detection  of  pre-measured  kernels.  The  estimated 
runtime  of  this  program  is  derived  as  the 
accumulated  time  of  all  kernels,  N. 
MacDonald[l24]  estimates  the  performance  of  a 
subset  of  Fortran??  programs  using  analytical 
time  formulae,  considering  only  primitive  control 
flow.  Benchmarks  are  used  to  pre-measure 
primitive  code  kernels  and  these  pre-measured 
times  are  used  as  parameters  in  the  analytical 
time  formulae. 

W.  Abu-Sufah  and  A.Y.  Kwok[l] 
present  a  set  of  performance  prediction  tools 
developed  for  the  Cedar  multi-processor  system. 
Their  approach  involves  analytical  and 
simulation  techniques  incorporating  guessing  for 
unknown  parameters.  Analysts  can  choose  from 
either  of  these  techniques  depending  on  the 
accuracy  required  (analytic  for  coarse  grain  and 
simulations  for  fine  grain). 

D.  Atapattu  and  D.  Gannon  [19]  obtain 
estimated  runtimes  for  parallel  FORTRAN 
programs  in  order  to  support  program 
transformation.  An  analytical  model  of  the  bus 
behavior  for  the  Alliant  FX/8  is  incorporated 
assuming  exponentially  distributed  processes  and 
a  queuing  model.  Their  estimates  are  algebraic 
expressions  of  unknown  loop  bounds  and  number 
of  processors. 

Modarch[225]  is  an  environment 
dedicated  to  performance  evaluation  of 
distributed  computing  systems.  Modarch  helps 
find  the  best  fit  between  hardware  configuration 
and  software  applications.  Modarch  is  built  upon 
the  Modline  environment  and  uses  a  dedicated 
version  of  the  QNAP2  simulation  software. 
Modarch  features  a  graphical  programming 
interface  that  allows  the  analyst  to  specify  the 
software  and  hardware  architecture.  The  tools 
included  in  the  Modarch  environment  are  :  (1) 
The  Experimenter  which  automatically  generates 
simulation  runs  of  a  model.  The  experimenter 
runs  the  model  for  each  possible  value  of  the 
input  parameters  and  stores  output  results.  (2) 
The  Analyzer  allows  interactive  extraction  of 
output  data  and  visualization  as  graphs  :  lines, 
bars,  pie  charts  etc.  (3)  The  Reporter  generates 
and  compiles  all  the  information  relevant  to  the 
report  subject  within  the  study. 

AIMS  (Automated  Instrumentation  and 
Monitoring  System)[226]  is  an  ongoing  effort  at 
NASA.  AIMS  consists  of  a  suite  of  software 
tools  for  measurement  and  analysis  of 
performance.  Our  area  of  interest  is  the  modeling 


facility  provided  with  the  AIMS  environment. 
The  Modeling  Kernel  (MK)  is  a  facility  in  AIMS 
for  modeling  parallel  programs.  MK  supports 
simulation-based  and  analytical  approaches  to 
performance  prediction  and  scalability  analysis, 
automates  the  process  of  building  and  simulating 
parallel-program  models.  Based  on  such  models, 
users  can  obtain  asymptotic  performance 
characteristics  for  either  the  entire  program 
(process  level)  or  individual  components 
(module  level).  The  main  component  of  MK  is 
GPPM  (Generator  of  Parallel-Program  Models). 
GPMM  models  parallel  programs  at  the  coarsest 
level,  capturing  only  the  duration  of  sequential 
blocks,  the  lengths  and  destinations  of  messages, 
loop  bounds  and  conditional  branch 
probabilities.  All  references  to  I/O  and  memoiy 
are  ignored.  The  model  structure  mirrors 
program  structure  and  is  derived  from  parse 
trees,  one  per  FORTRAN  or  C  module. 

Axe[221]  is  an  integrated  set  of  tools 
for  the  analysis  of  algorithms  and  partitioning 
strategies  on  mesh-connected  concurrent 
processors.  The  tool  set  includes  a 
compiler/translator,  a  simulator,  a  monitor  and  an 
experimentation  executive  for  processing  user 
generated  commands.  The  user  specifies  the 
program  to  be  analyzed  using  a  behavior 
description  language.  The  behavior  description 
language  allows  the  user  to  specify  the  model 
using  similar  constructs  as  the  application 
implementation  language,  except  that  the 
execution  time  of  the  statements  is  simulated.  In 
addition  to  providing  a  program  description,  the 
user  has  the  ability  to  define  characteristics  of  the 
run-time  environment,  including  message  I/O 
overhead,  process  creation  overhead, 
communication  link  bandwidths,  number  of 
nodes  and  the  amount  of  memory  per  node.  The 
user  can  also  select  from  a  limited  set  of  built-in 
topologies,  routing  algorithms,  partitioning 
algorithms,  and  scheduling  algorithms. 
Performance  data  is  generated  by  discrete-event 
simulation  of  the  model. 

The  Network  Emulation  Tool 
(NET)[20]  is  a  computer  network  simulation  that 
supports  the  analysis  of  distributed  operating 
systems  and  distributed  databases.  NET  provides 
a  default  network  description  which  can  be 
altered  by  the  user  to  represent  a  limited  set  of 
networks.  Several  network  parameters  can  be 
user  defined  including  message  delay,  message 
loss,  message  duplication,  node  failure  rate,  and 
network  partitioning.  An  user  defined  network 
description  can  be  created  when  the  default 


description  is  inadequate.  NET  generates 
statistical  output  on  network  performance  as  well 
as  algorithm  performance. 

The  Rice  Parallel  Processing  testbed 
[54]  is  an  execution-driven  environment  for  the 
analysis  of  concurrent  programs.  Actual 
workloads  are  executed  on  the  target  computer  to 
obtain  realistic  processing  delays  while  all 
interprocess  communications  and  interactions  are 
simulated  allowing  for  a  variety  of  architectures 
to  be  evaluated.  The  user  must  provide  three 
types  of  input  :  a  concurrent  program,  a 
simulation  model  of  the  architecture,  and  a 
process-to-hardware  mapping.  The  environment 
comprises  of  (1)  Concurrent  C  -This  version  of 
C  supports  parallel  programming.  The  program 
to  be  evaluated  as  to  be  written  in  this  language. 
(2)  C  Simulation  Package  -  This  package  is  a 
discrete-event  simulator  for  event  queue 
manipulation,  data  collection,  and  tracing.  (3) 
Architecture  Simulation  Preprocessor  -  The 
preprocessor  inserts  simulation  primitives  into  a 
Concurrent  C  program.  These  primitives 
represent  inter-process  communication  and 
synchronization  delays.  (4)  Timing  Profiler  -  The 
profiler  is  an  assembly  language  analyzer  that 
estimates  execution  time  of  sequential  code 
segments.  (5)  Simulation  Tool  Interface  -  The 
user  interface  is  menu  driven  and  supports 
windowing.  (6)  Parallel  Tracer/Debugger  -  This 
tool  supports  a  windows-oriented  user  interface 
for  monitoring  and  controlling  model  execution. 
(7)  Library  -  A  library  is  provided  for  storing 
concurrent  C  programs  and  architecture  models. 

QASE[227]  is  an  analytic  and 
simulation  modeling  tool  for  distributed 
client/server  applications.  QASE’s  system 
description  is  a  hierarchical  entity-attribute 
specification.  Entities  in  a  system  include 
execution  flow  diagrams,  workloads  (periodic  or 
random),  hardware  diagrams  (processor,  storage 
and  communication  architectures),  software,  data 
(data  stores  and  flows),  operating  systems, 
communication  protocols,  and  allocations.  QASE 
uses  multiple  evaluation  techniques  using  the 
analytic  approach  to  evaluate  feasibility  of 
alternate  system  descriptions  and  discrete-event 
simulation  for  detailed  analysis  during  final 
stages  of  design.  QASE  also  supports  automatic 
model  generation  by  populating  the  model  with 
performance  metric  data  collected  using  HP’s 
MeasureWare  Agent. 

The  Vienna  FORTRAN  Compiler 
System  (VFCS)[45]  has  a  parameter  based 
performance  prediction  tool  in  its  tool  kit. 


Parameter  based  performance  prediction  of 
FORTRAN  programs  is  made  possible  by  this 
tool.  Workload  parameters  including  work 
distribution,  number  of  data  transfers,  transfer 
times,  network  contention,  cache  miss  ratio,  and 
main  memory  performance  can  be  modeled 
analytically.  The  parameters  are  modeled  and 
expressions  are  derived  for  statements,  loops, 
procedures  and  the  entire  program.  The  flow 
variables  (control  and  data)  are  not  guessed 
(specified  by  the  designer)  but  are  estimated 
through  profile  runs  of  the  code.  Current  work  is 
focused  on  training  the  tool  by  running  different 
program  profiles  under  different  workload 
conditions. 

PEPP  (Performance  Evaluation  of 
Parallel  Programs)[59]  is  a  modeling  tool  for 
creating  and  evaluating  stochastic  graph  models 
of  parallel  and  distributed  programs.  PEPP  offers 
functions  for  graphical  model  creation  and 
various  evaluation  methods  for  calculating  the 
mean  runtime  of  a  program.  PEPP  supports  the 
idea  of  model-driven  monitorings  where 
modeling  and  monitoring  are  integrated  into  a 
framework  to  support  easier  evaluation,  tuning 
and  debugging  of  parallel  and  distributed 
systems.  PEPP  is  implemented  in  C  with  the 
graphical  user  interface  implemented  on  top  of 
the  X  Windows  system. 

MENTOR  (Model  based  EveNT  Trace 
analysis  suppORt  system)[60]  is  an  expert  system 
which  assists  in  the  trace  evaluation  of  parallel 
and  distributed  programs  by  incorporating 
knowledge  about  the  program  under  investigation 
into  a  trace  analysis  environment  SIMPLE[107]. 
The  knowledge  is  derived  from  stochastic  graph 
models  created  with  PEPP. 

3,2  Measurement  Tools  : 

Process  level  measurement  tools  require 
instrumentation  of  the  relevant  code  (process 
code  or  operating  system  code)  for  which 
performance  data  is  to  be  obtained.  One  of  the 
major  concerns  facing  designers  of  such  tools  is 
the  perturbation  introduced  in  the  performance 
data  obtained  as  a  result  of  the  measurement 
process.  Instrumentation  of  the  code  has  to  be  as 
non-intrusive  as  possible  so  as  to  obtain  accurate 
results.  The  following  sections  survey  some 
measurement  tools  designed  specifically  for  the 
process  level. 


Process  Level  Measurement  Tools 

The  Berkeley  UNIX  Monitor[l37]  is  a 
software  monitor  within  the  kernel  for  measuring 
the  performance  of  a  distributed  program.  The 
monitor  is  a  distributed  program  capable  of 
executing  its  functions  on  a  user  specified 
processor.  The  monitor  provide  four  functions : 

•  Meter  •  detects  and  records  events  within  the 
kernel  so  as  to  produce  a  trace.  Trace  data 
can  include  creation/destruction  of 
processes,  starting/stopping  of  processes, 
and  inter-process  communication. 

•  Filter  -  extracts  user  specified  trace 
information  from  the  trace  data  generated. 

•  Control  -  provides  an  interface  to  the  user  to 
control  the  measurement  process. 

•  Analysis  -  analysis  routines  can  be  defined 
by  the  user  to  summarize  and  report  on  the 
filtered  traces. 

The  UNIX  gprof  utility[89]  introduces 
the  concept  of  a  dynamic  call  graph  generation 
for  an  execution  of  a  program.  The  dynamic  call 
graph  contains  one  node  for  each  routine  that  is 
invoked  as  the  program  executes.  Each  directed 
arc  in  the  graph  connects  a  caller  with  a  callee. 
The  gprof  routines  build  the  dynamic  call  graph 
from  a  program  run.  At  compile  time,  calls  to  an 
event  recording  routine  are  inserted  at  the  entry 
to  each  subroutine.  When  the  subroutine  is 
called,  an  arc  between  the  caller  and  the  callee  is 
recorded  in  a  table.  The  graph  is  generated  by  a 
post  processor  after  termination  of  the  process. 

Monit[lll]  is  a  performance  monitor 
for  the  Sequent  Balance  8000  system.  Events  of 
interest  included  task  and  process  creation  and 
termination,  entry  to  and  exit  from  resource 
queues,  and  a  general  “value  trace”  event.  Active 
recorders  log  event  occurrences  to  a  buffer  in 
memory.  A  separate  process  is  responsible  for 
transferring  the  buffer  contents  to  permanent 
storage. 

Radar[l  19]  is  a  debugging  tool  to  assist 
in  analysis  of  distributed  applications.  The 
applications  execute  on  a  network  of  PERQ 
workstations.  The  events  of  interest  for  the 
developers  of  Radar  included  process  creation 
and  termination,  message  transmission  and 
reception,  port  creation  and  a  general  purpose 
event.  Events  are  recorded  by  the  node  in  which 
they  occur.  Each  event  is  marked  with  an  event 
number  as  there  is  no  concept  of  a  global 
synchronized  time.  The  event  recorder  copies  the 
content  of  each  message  as  a  part  of  the  event 


record.  This  feature  facilitates  the  replay  of  the 
entire  experiment  in  “single  step”  mode. 

PCA  (Performance  and  Coverage 
Analyzer)[62]  is  a  performance  measurement 
tool  designed  for  the  VAX  architecture.  PCA  has 
been  used  for  both  uniprocessor  and 
multiprocessor  applications.  Measurement 
experiments  are  divided  into  two  phases:  the 
collection  phase  and  the  analysis  phase.  The 
collection  phase  involves  sampling  the  program 
counter  at  intervals  determined  by  the  system 
timer  (ten  milliseconds).  Histograms  of  program 
activity  by  subroutine,  or  even  by  line  of  source 
code  can  be  generated.  The  time  cost  of  work 
done  by  the  low  level  subroutines  can  be 
propagated  back  to  statements  within  the  higher 
level  subroutines  when  the  call  stack  information 
is  accumulated,  PCA  allows  the  insertion  of 
software  trace  markers  that  allow  other  statistics 
namely  (1)  number  of  invocations  of  each 
selected  subroutine  or  code  fragment,  (2)  the 
number  of  page  faults  incurred  by  each  module, 
routine  or  line  of  code,  (3)  frequency  of  requests 
for  system  services  by  location  in  the  application. 

The  FORTRAN  Analyzer[123]  is  a 
syntax  driven  software  that  inserts  monitoring 
code  into  an  American  National  Standard 
FORTRAN  program.  Parameters  passed  to  the 
monitoring  routine  include  the  segment  being 
monitored  and  the  monitoring  routine’s  entry 
point.  A  code  segment  is  enclosed  between  the 
entry  and  exit  points.  Thus  this  tool  can  be  used 
to  monitor  performance  at  the  module  level  by 
appropriate  instrumentation.  The  storage 
requirements  for  instrumented  programs  may 
increase  by  26  to  55%. 

Parasight[18]  is  an  environment  for 
performance  analysis  of  sequential  and  parallel 
programs.  The  platform  for  Parasight  is  UNIX  on 
the  Encore  Multimax  which  is  a  shared-memory 
multiprocessor  system.  Parasight  is  executed 
concurrently  with  the  program  to  be  monitored 
which  is  embedded  within  the  Parasight 
environment.  The  environment,  upon  startup 
initializes  a  multitasking  environment.  The 
monitored  program  is  loaded  into  this 
environment  and  a  memory  resident  symbol  table 
is  created.  The  code  is  executed  concurrently 
with  Parasight  programs  that  monitor  shared 
memory.  Parasi^t  routines  can  be  offloaded  to 
processors  other  than  the  one  being  used  by  the 
monitored  code  to  reduce  interference.  Parasight 
provides  breakpoints  that  can  be  created  and 
deleted  dynamically  at  run  time. 


The  Parallel  Software  Environment 
(PSE)[228]  is  a  performance  analysis  product 
from  DEC  that  includes  a  loop-capable  and 
parallel  profiler  for  high  performance 
FORTRAN.  The  profiler  provides  information 
about  the  time  spent  in  logical  sections  of  the 
code  such  as  do-loops.  The  profiler  allows 
programmers  to  view  program-unit  and 
statement-level  timing  information  about  parallel 
execution.  The  performance  information  also 
includes  communication  times  included  with 
individual  FORTRAN  statements. 

The  Programming  and  Instrumentation 
Environment  (PIE)[120]  is  a  framework  for 
developing  techniques  to  predict,  detect  and 
avoid  performance  degradation  in  parallel  and 
distributed  programs  in  a  shared  memory 
multiprocessor  environment.  PIE  supports  the 
analysis  of  parallel  process  composition, 
communications,  and  data  partitioning.  PIE  is 
implemented  on  top  of  the  Mach  kernel,  PIE 
provides  a  customized  visual  editing  system 
through  which  the  user  identifies  the  principal 
programming  constructs.  PIE  provides  a  meta¬ 
language  to  support  the  development  of  parallel 
algorithms  for  observation  and  analysis.  The 
meta-language  is  used  in  conjunction  with  Pascal 
and  extends  its  capabilities  by  providing  parallel 
functionality  such  as  synchronization,  access  to 
shared  data,  etc.  After  the  source  code 
visualization,  PIE  allows  for  automatic 
observation  of  constructs  within  the  code.  PIE’s 
instrumentation  is  currently  done  using  software 
instrumentation  techniques.  An  Implementation 
Assistant  tool  provides  semantic  support  for 
parallel  program  development.  The  tool  helps 
predict  program  performance  before 
implementation  and  assists  the  user  in  selecting  a 
parallel  implementation.  PIE’s  visualization 
utilities  include  histograms  and  time-lines. 

The  JADE[220]  programming  system  is 
a  distributed  monitoring  facility  consisting  of  two 
parts  :  data  detection  and  collection,  done  by  so- 
called  channel  processes,  and  data  analysis  and 
presentation,  done  by  so-called  consoles.  JADE 
extends  debugging  support  to  distributed 
applications  based  on  inter-process 
communication. 

The  INCAS  [220]  project  at  the 
University  of  Kaiserslautern  has  developed  a  tool 
for  measuring  the  performance  and  observing  the 
behavior  of  distributed  systems  during  execution. 
A  hardware  support  module,  called  Test  and 
Measurement  Processor  (TMP)  is  integrated  into 
each  node  of  a  distributed  system.  All  TMP’s  are 


connected  to  a  central  monitoring  station  via  a 
measurement  LAN.  Sensor  code  in  the  monitored 
system  is  reduced  to  single  store  instructions  for 
event  signaling,  leading  to  very  low  interference. 

Sun  Microsystems  provides 
SPARCworks[229],  a  tool  to  support  dynamic 
analysis  and  control  of  multi-threaded  programs. 
SPARCworks  supports  analysis  of  the  code  for 
potential  synchronization  errors  such  as 
deadlocks  and  data  race  conditions.  Detailed 
thread  level  profiling  is  also  supported. 

JEWEL[117]  is  another  distributed 
measurement  environment  that  consists  of  four 
functional  blocks : 

•  the  system  under  test  (SUT), 

•  the  data  collection  and  reduction  system 
(DCRS), 

•  the  graphical  presentation  system  (GPS), 

•  and  the  experiment  control  system  (ECS). 
Measurement  data  is  extracted  from  the  SUT, 
collected  and  filtered  by  the  DCRS,  and  then 
passed  to  the  GPS  for  visualization  to  the 
experimenter  concurrent  with  the  operation  of  the 
SUT.  Interpreting  the  visualized  data  may  result 
in  actions  e.g.  customizing  the  graphical 
appearance  or  taking  a  snapshot,  control  requests 
issued  to  the  ECS  ,  e.g.  to  change  the  level  of 
detail,  to  stop  the  current  experiment,  or  to  set  up 
a  new  configuration. 

SPY[214]  is  a  software  monitor  that 
does  periodic  location  counter  sampling  to 
determine  the  performance  of  an  application 
program.  Function  calls  provided  to  the  user 
include  a  setup  call  that  initializes  SPY,  an 
activate  monitor  call  that  turns  on  monitoring, 
and  a  terminate  monitor  call  which  turns  off 
monitoring.  The  startup  call  requires  that  the  user 
specify  a  histogram  array  name,  array  address, 
and  sampling  interval.  All  measurement  data  is 
stored  within  the  histogram  array  in  the  address 
space  of  the  program. 

TX-2[148]  is  a  time-shared  system  that 
provides  a  hardware  monitor  for  measuring 
program  performance.  The  monitor  has  access  to 
the  program  counter  and  index  registers.  The 
monitor  can  track  events  as  they  occur  in  the 
processor  and  update  the  relevant  parameters. 
Thus  a  histogram  of  the  desired  parameter  is 
available  upon  program  termination. 

MemSpy[132]  is  a  tool  that  helps 
programmers  identify  memory  bottlenecks  in 
parallel  and  sequential  programs.  MemSpy 
provides  information  such  as  cache  miss  rates, 
causes  of  cache  misses,  and  in  multi-processor 


systems,  information  on  cache  invalidations  and 
local  versus  remote  memory  misses. 

MTOOL[86]  is  a  tool  aimed  at  detecting 
regions  of  a  program  where  the  memory 
hierarchy  is  performing  badly.  MTOOL 
identifies  memory  bottlenecks  by  comparing 
the  measured  execution  time  with  the  predicted 
time  for  a  perfect  memory  hierarchy.  MTOOL  is 
aimed  at  FORTRAN  programs  running  on  MIPS 
based  workstations. 

IPS-2[138]  defines  a  computational 
hierarchy  on  the  program  being  monitored.  The 
program  is  represented  3s  a  black  box  at  the 
highest  level.  The  next  level  is  the  machine  level 
where  the  program  is  split  into  several  concurrent 
processes  executing  on  different  processors.  The 
third  level  represents  the  program  as  a  collection 
of  communicating  processes.  The  final  level  is 
the  primitive  activity  level.  IPS-2  uses 
instrumentation  probes  to  generate  trace  data  and 
then  evaluates  the  performance  data.  The 
instrumentation  provided  includes  a  gprof  style 
profiler  that  records  procedure  entry  and  exit 
events,  and  modified  run-time  libraries. 

The  Annai[A^  tool  Environment  is 
intended  for  the  development  and  performance 
evaluation  of  parallel  and  distributed 
applications.  Tool  components  include  :  (1)  A 
Parallelization  Support  Tool  (PST)  for  data- 
parallel  program  development  with  particular 
focus  on  unstructured  computations.  (2)  A 
Parallel  Debugging  Tool  (PDT)  supporting 
interactive,  source-level  debugging  and  global 
program  views.  (3),  Performance  Monitor  and 
Analyzer  (PMA)  for  directed  interactive 
identification  and  tuning  of  performance 
problems.  (4)  A  Common  graphical  user 
interface  (UI)  and  tool/machine  interface  (TSA). 
We  concentrate  on  the  measurement  section  of 
the  environment,  the  PMA.  Measurement  and 
monitoring  of  the  code  is  done  by  instrumenting 
the  communication  library  and  the  compilation 
system.  The  tool  also  features  dynamic 
instrumentation  and  insertion  within  executables. 
A  run-time  execution  profile  accumulation  and 
event  trace  buffering  is  made  available  to  the 
analyst. 

The  AIMS  suite  discussed  earlier 
provides  tools  to  measure  the  execution  of 
parallel  code.  AIMS  provides  (1)  xinstrument,  a 
source  code  instrumentor  that  supports  Fortran?? 
and  C  message-passing  programs  written  under 
two  communication  libraries  :  MPI  and  PVM.  (2) 
monitor,  a  library  of  timestamping  and  trace- 
collection  routines  that  run  on  the  IBM  SP-2,  as 


well  as  networks  of  workstations  (including 
Convex/HP  clusters,  SparcStations  and  SGIs). 
(3)  pc,  a  utility  for  removing  the  monitoring 
overhead  and  its  effects  on  the  trace  generated. 

WAT  (Workload  Analyzer  Tool)[164] 
is  an  effort  by  the  University  of  Pavia  in 
collaboration  with  the  University  of  Milan.  WAT 
provides  cluster  analysis  and  other  statistical 
analysis  and  is  driven  by  a  graphical  user 
interface.  It  accepts  traces  in  a  number  of 
standard  formats  and  further  formats  can  be 
added  by  modifying  the  input  section. 

The  MEasurements  Description 
Evaluation  and  Analysis  tool  (MEDEA)[134], 
supports  the  analysis  of  trace  data.  The  various 
stages  of  trace  analysis  include  (1)  preliminary 
analysis  of  trace  data  to  correlate  the  events 
recorded  during  the  execution  of  an  application 
to  prepare  the  data  for  further  analysis.  (2) 
definition  of  a  format  which  is  a  subset  of 
performance  parameters  associated  with  the 
current  workload  component.  (3)  cluster  analysis 
to  allow  the  identification  of  classes  of  events 
with  respect  to  certain  pargimeters.  (4)  A  fitting 
module  allows  compact  analytic  descriptions  of  a 
workload,  which  represent  the  variation  of 
workload  parameters  with  respect  to  independent 
variables,  such  as  time.  (5)  A  functional 
description  module  allows  a  logical,  rather  than  a 
physical  description  of  the  workload.  The 
workload  is  viewed  in  terms  of  membership  of 
components  to  a  specific  cluster,  rather  than  in 
terms  of  overall  resource  utilization  such  as 
processing  time  and  (6)  data  visualization 
allowing  interactive  examination  of  the  workload 
models. 

SP[  140]  uses  hierarchical  structuring  of 
systems  into  components  and  modules,  allowing 
workloads  at  different  levels  to  be  mapped  onto 
each  other.  A  so  called  complexity  function  is 
defined  as  how  much  work,  in  terms  of  memory, 
communication  capacities  and  processor  usage  at 
one  level  corresponds  to  units  of  work  at  another. 
SP  can  be  used  for  mapping  measurements  onto 
required  input  parameters  of  performance  models 
and  for  certain  simple  direct  modeling,  such  as 
capacity  management  decisions. 

Measuring  Operating  system  performance  : 
Modem  operating  systems  (Solaris,  NT,  95, 
OS/2)  provide  on-line  performance  meters  to 
provide  the  user  with  a  continuous  visualization 
of  system  performance. 


Imbench[229]  is  a  suite  of  portable 
benchmarks  that  compares  the  performance  of 
different  UNIX  systems.  Imbench  runs  a  set  of 
benchmark  programs  on  the  target  machine  in 
order  to  obtain  performance  data.  Benchmark 
results  are  available  for  most  major  vendors 
(SUN,  HP,  IBM,  DEC,  SGI  and  PCs).  Imbench 
is  a  free  software  covered  by  the  GNU  general 
public  license.  Imbench  provides  bandwidth 
benchmarks  including  cached  file  read,  memory 
read/write/copy,  and  pipes.  Latency  benchmarks 
include  context  switching,  file  system  creates  and 
deletes,  process  creation,  system  call  overhead 
and  memory  read  latency. 

SymbEL  (SE)[229]  is  an  interpreted 
language  that  acts  as  a  toolkit  for  building 
performance  tools  and  utilities.  SE  provides 
scripts  that  build  on  the  basic  tools  (vmstat, 
iostat,  sar  etc.)  to  provide  rule-based 
performance  monitors  and  viewers.  The  package 
includes  a  Motif  based  GUI  library  and  a  rules 
library. 

The  AXXiON[230]  performance 
manager  provides  performance  monitoring  for 
UNIX  and  Windows  NT  systems.  The  AXXiON 
performance  manager  can  be  configured  to 
collect  data  on  real-time  performance  elements 
including  memory  utilization,  disk  I/O, 
individual  processes  and  other  system/network 
activities.  It  delivers  snapshots  of  resource 
activity  correlating  performance  data  from  a 
variety  of  resources. 


3.3  Visualization  Tools  : 

Visualization  tools  provide  the 
developer  with  a  visual  display  of  program 
execution.  These  tools  are  useful  when  the 
behavior  of  a  program  cannot  be  inferred  easily 
by  statistical  analysis  alone.  Though  visualization 
tools  fall  into  one  of  the  three  categories 
(measurement,  simulation,  modeling),  they 
deserve  special  attention  because  of  their  unique 
graphics  features.  Visualization  tools  can  be  on¬ 
line  or  postmortem  tools.  Visualization  tools  that 
support  postmortem  analysis  do  not  instrument 
the  code  being  monitored.  They  need  a  trace  file 
as  input  to  be  processed  and  visualized. 

CHIRON[87]  is  a  visualization  system 
developed  at  the  University  of  Cape  Town  for 
displaying  performance  related  behavior  of 
shared  memory  microprocessor  applications.  The 
tool  is  primarily  used  as  a  performance 
debugging  tool  which  can  be  utilized  by  the 


designer  to  fine-tune  or  remove  performance 
bottlenecks.  CHIRON  uses  3D  graphics  to 
generate  various  performance  related  views 
which  can  be  scaled,  rotated,  translated,  animated 
or  level-of-detail  toggled.  CHIRON  is  used  to 
system  performance  (emphasis  on  cache 
performance),  synchronization  costs,  and  data 
partitioning  in  a  parallel  program.  It  has  also 
been  used  to  optimize  sequential  programs  that 
waste  time  through  ineffective  use  of  the  memory 
hierarchy. 

ParaGraph[218]  takes  as  input  trace 
data  generated  by  the  Portable  Instrumented 
Communication  Library  (PICL),  developed  at 
Oak  Ridge  National  Labs  and  provides 
visualization  of  program  behavior.  PICL  can  also 
provide  execution  trace  data  during  an  actual  run 
of  a  parallel  program  and  the  resulting  trace  data 
can  provide  dynamic  snapshots  of  the  behavior. 
ParaGraph  organizes  the  information  into  various 
views  in  an  attempt  to  cope  with  the  massive 
amount  of  raw  information  generated.  ParaGraph 
runs  on  the  X  Window  System  and  is 
implemented  using  the  Xlib  library  for  portability 
reasons.  ParaGraph  is  designed  to  be  responsive 
to  user  interactions  while  displaying  program 
behavior  dynamically.  The  execution  behavior  of 
ParaGraph  can  be  static  (initial  selection  of 
parameter  values)  or  dynamic  (pause,  resume, 
single  step  etc.).  ParaGraph  is  extensible  with 
users  having  the  ability  to  add  new  displays  of 
their  own  design.  This  feature  supports  the  use  of 
application-specific  displays  that  can  be  used  to 
augment  the  insight  that  the  generic  views 
provide. 

PARvis[l44]  is  a  tool  used  on  a  post¬ 
mortem  basis  to  translate  a  given  trace  file  into  a 
variety  of  graphical  system  views  which  provide 
a  reasonable  basis  for  system  understanding  and 
program  optimization.  PARvis  takes  as  input  an 
/pi-generated  trace  file  and  extracts  graphical 
information.  Different  views  of  the  PARvis 
system  include  single  time  system  snapshots, 
animation,  statistics  and  a  time-line  system  view. 
PARvis  is  implemented  in  C  and  uses  the  Motif 
libraries  for  its  graphic  capabilities.  Hardware 
platforms  include  IBM  RS/6000,  SUN,  DEC 
MIPS  and  Alpha  systems.  Extensions  to  PARvis 
include  display  of  network  activities  and  flow  of 
messages  on  different  topologies.  PARvis 
provides  configuration  files  that  the  user  can  edit 
from  run  to  run.  Parameters  include  color,  layout, 
fonts  etc. 

PV  (Program  Visualizer)[23 1  ] 
developed  at  IBM,  provides  continuous  visual 


displays  of  the  behavior  of  a  program  and  an 
underlying  system.  PV  is  designed  as  a  tool  for 
debugging  and  performance  tuning  and  analysis. 
PV  has  been  targeted  to  run  on  shared-memory 
parallel  systems  and  superscalar  uniprocessor 
workstations  (RISC  System/6000  with  AIX).  PV 
shows  hardware-level  performance  information, 
operating  system  level  activity,  communication 
library  level  activity,  language  run  time  activity 
and  application  level  activity.  Thus  PV  can  be 
used  as  a  process  level  and  system  level  monitor. 
Users  can  add  their  custom  configured  modules 
to  analyze  application  specific  characteristics. 
PV  has  been  used  to  gain  more  insight  into  the 
structure  and  dynamics  of  large  object-oriented 
applications,  frameworks  and  libraries. 

Pablo[5]  is  an  ongoing  research  project 
being  developed  at  the  University  of  Illinois. 
Pablo  is  designed  to  provide  performance  data 
capture,  analysis  and  presentation  across  scalable 
parallel  systems.  Pablo  is  best  described  as  a 
toolkit  for  the  construction  of  performance 
analysis  environments.  Pablo  consists  of  a 
portable  source  code  instrumentation  subsystem 
and  a  performance  data  analysis  subsystem  with 
a  trace  data  meta-format  coupling  the  two.  The 
performance  analysis  component  of  Pablo 
consists  of  a  set  of  data  transformation  modules 
that  can  be  interconnected  to  form  a  data  analysis 
graph.  Performance  data  flows  through  the  graph 
nodes  and  is  transformed  to  yield  the  desired 
performance  metrics.  Interesting  features  of 
Pablo  include  immersive  virtual  reality  to  display 
performance  data  and  sonification  by  which 
performance  data  is  displayed  by  the  use  of  sonic 
data  presentation. 

PARADE  (PARallel  program 
Animation  Development  Environment)[202]  is 
an  ongoing  project  at  the  Graphics,  Visualization 
and  Usability  center  in  Georgia  Tech  to  support 
the  design  and  implementation  of  software 
visualization  of  parallel  and  distributed 
programs.  PARADE  contains  components  for 
monitoring  a  program’s  execution,  building  the 
software  visualization  and  mapping  the  execution 
to  the  visualization.  The  primary  operation  of 
PARADE  is  post-mortem  visualization  with  trace 
files.  Software  instrumentation  is  layered  with 
decreasing  level  of  programmer  involvement. 
Instrumentation  methods  include  inclusion  of 
print  functions  at  specific  points  in  the  program, 
overriding  the  standard  communication  library 
with  macros  and  actual  modification  to  the 
library  code  to  turn  on/off  trace  flags.  PARADE 
visualizations  include  processor  grid  view,  data 


distribution  views,  communication  history, 
message  passing  views  etc.  Visualization  in 
PARADE  is  built  around  the  Polka  animation 
system.  Polka  provides  an  object  oriented 
interface  to  developers  which  makes  coding 
complicated  graphics  easier. 

4  MODULE  LEVEL  ANALYSIS  : 

The  tools  developed  for  analysis  at  this 
level  of  granularity  rely  heavily  on  mathematical 
methods  to  derive  time  cost  equations.  Since  the 
system  under  analysis  is  not  too  complex,  pure 
analytical  techniques  can  be  used  to  derive  time 
cost  or  other  performance  equations.  Many  of  the 
tools  and  techniques  described  in  the  section  on 
process  level  modeling  tools  can  be  utilized  here. 
This  section  surveys  tools  that  have  been 
developed  to  provide  a  pure  analytic  solution  to 
the  performance  prediction  problem. 

Metric[217]  is  an  analytic  tool  for 
estimating  the  execution  time  of  simple  LISP 
programs.  The  user  must  supply  as  input  (1)  a 
LISP  program,  (2)  a  cost  table  defining  the  time 
cost  of  basic  LISP  operations,  and  (3)  procedure 
definitions.  The  procedure  definitions  are  the 
previously  analyzed  procedures  of  the  LISP 
program  and  their  input  is  optional.  Metric  will 
not  re-evaluate  these  procedures  in  the  event  that 
they  are  supplied.  The  time  cost  of  the  program 
is  evaluated  in  three  phases.  (1)  program 
expressions  are  converted  to  cost  expressions 
based  on  the  cost  table.  (2)  Recursive  procedure 
calls  are  converted  into  a  set  of  difference 
equations  which  are  solved  in  (3)  to  produce 
closed  form  expressions.  Metric  produces  closed- 
form  expressions  characterizing  the  execution 
behavior  of  the  LISP  program  and  procedure 
definitions  to  be  used  in  future  analysis  efforts. 

The  Time  Cost  Analysis  System 
(TCAS)[188]  is  a  Computational  Structure 
Model  (CSM)-based  tool  for  analyzing  the 
execution  times  of  parallel  computations.  The 
CSM  methodology  represents  a  computation  as  a 
control  graph  and  data  graph.  The  control  graph 
shows  the  order  in  which  the  operations  are 
performed  and  comprises  of  activity  nodes  (start, 
operation,  decision  etc.)  and  edges.  An  activation 
signal  propagates  through  the  graph  representing 
an  execution  thread.  A  weight  associated  with  the 
edge  specifies  the  number  of  times  that  path 
should  be  executed.  The  data  flow  graph,  similar 
in  structure  to  the  control  flow  graph,  depicts  the 
relationship  between  the  data  and  the  operations 
of  a  computation.  The  computation  to  be 
evaluated  is  written  in  a  Pascal  like  language, 


checked  for  correctness  of  syntax  and  stored  in  a 
library  for  retrieval  for  future  analysis.  The 
computation  structure  with  the  flow  values 
(designer  specified)  can  be  solved  analytically  to 
obtain  a  time  cost  expression  which  can  then  be 
used  to  compute  different  performance  estimates 
including  minimum,  maximum  and  average 
execution  times  and  to  plot  time  cost  curves. 

TCAS  makes  a  number  of  assumptions 
about  the  computational  environment  at  runtime. 
The  environment  is  characterized  by  a  limited 
number  of  homogeneous  processors  that 
communicate  through  shared  memory  and 
balance  the  load  equally.  The  last  assumption  is 
strengthened  by  the  development  of  the  Optimal 
Allocation  System  (OPAS)[168]  that  provides 
four  allocation  policies  :  (1)  Equal,  (2)  Enough, 
(3)  Sequential  and  (4)  degree  of  parallelism. 
OPAS  is  constructed  and  operates  in  a  manner 
similar  to  TCAS  except  that  time  costs  can  only 
be  solved  analytically.  OPAS  determines  an 
allocation  policy  leading  to  minimal  execution 
times  and  all  performance  calculations  are  based 
on  this  policy. 

The  Data  Flow  Analysis  System 
(DFAS)[169]  estimates  the  execution  times  of 
data  flow  programs.  DFAS  provides  a  similar 
interface  as  that  of  TCAS,  but  the  underlying 
methodology  used  in  the  computation  of  time 
costs  vary.  DFAS  is  based  on  a  token  model  in 
which  the  computation  is  modeled  as  a  graph. 
Data  flow  is  modeled  as  tokens  that  traverse  the 
graph.  A  node  is  activated  when  the  appropriate 
number  of  tokens  become  available  on  the  node’s 
input  edges.  In  addition  to  the  computation,  the 
user  should  provide  (1)  the  time  cost  of  each 
node,  (2)  the  time  cost  of  each  edge  and  (3) 
independent  data  flows  in  the  computation. 
DFAS  computes  minimum,  maximum,  and 
average  time  costs,  time  cost  variance  and  time 
cost  distribution  of  the  computation. 

5  STATISTICAL  SUPPORT  TOOLS  : 

Performance  analysis  of  computer 
systems  can  produce  an  abundance  of  raw  data 
that  has  to  be  managed  and  statistically 
processed.  This  has  become  a  serious  concern  to 
designers  of  performance  analysis  tools  for 
parallel  and  distributed  systems  due  to  the 
overwhelming  amount  of  data  generated.  Most  of 
the  tools  surveyed  earlier  have  capabilities  to 
manage  and  analyze  the  generated  data  or 
provide  users  with  the  utilities  to  do  so. 
Additional  statistical  support  tools  may  be 
needed  to  augment  the  existing  statistical 


capabilities  of  the  tool.  Table  1  provides  a 
summary  of  some  of  the  statistical  support  tools. 

6  A  CLASSIFICATION  METHODOLOGY  ; 

The  sheer  number  of  tools  available  to  a 
performance  analyst  makes  selecting  an 
appropriate  tool  a  challenging  task.  This  section 
presents  a  classification  scheme  that  partitions 
the  tools  based  on  their  properties.  A  database 
can  be  designed  around  this  classification  scheme 
that  would  then  enable  performance  analysts  to 
retrieve  tool  information  based  on  a  keyword 
search. 

The  classification  scheme  is  presented 
as  a  list  of  figures.  The  scheme  is  tree  based  with 
each  node  of  the  tree  representing  a  property 
unique  to  a  set  of  tools.  The  tool  list  is  refined  as 
we  traverse  the  depth  of  the  tree.  Some  of  the 
performance  tools  surveyed  in  this  paper  can  be 
placed  in  more  than  one  category  and  thus  can  be 
placed  under  multiple  nodes  in  the  classification 
tree. 
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7  CONCLUSION : 

This  paper  has  surveyed  a  number  of 
performance  tools  that  have  been  or  are  in  the 
process  of  being  developed  in  academia  and 
industry.  Many  of  the  tools  surveyed  use 
sophisticated  techniques  to  simulate  or  measure 
the  performance  of  the  target  system  (software 
and  hardware).  The  tools  surveyed  fall  into  three 
broad  categories  namely  system  level,  process 
level  and  module  level.  The  measurement  tools 
that  were  surveyed  in  this  paper  employ 
sophisticated  techniques  to  capture  system 
information  in  an  uni-processor/multi-processor 
environment.  A  majority  of  the  modeling  tools 
surveyed  provide  the  analyst  the  option  to  solve 
the  system  model  analytically,  in  addition  to 
detailed  simulation  capabilities.  Thus  the 
performance  analyst  can  use  the  analytical 
solution  to  obtain  coarse  estimates  of  system 


performance  in  addition  to  simulating  the  model 
to  obtain  more  accurate  estimates  at  the  cost  of 
more  computational  power.  The  paper  also 
discussed  a  classification  scheme  to  aid 
performance  analysts  obtain  information  about 
the  various  tools.  The  information  can  aid  the 
designer  in  making  a  decision  as  to  the  type  of 
tool  to  be  used  to  estimate/evaluate  the 
performance  of  the  underlying  system. 


TABLE  1 

Summary  of  Statistical  Took 

A:  basic  statistics  (mean,  median,  standard  deviation,  etc.);  B:  analysk  of  variants;  C:  multivariate 
analysis;  D:  regression  analysis;  E;  cluster  analysis;  F:  time-series  analysis;  G:  correlation;  H:  non- 
parametric  statistics;  J:  random  number  generation. 


Tool 

Description 

Platform 

Capabilities 

Ref  No. 

BASS 

BASS  provides  a  limited 
collection  of  statistical  routines. 

IBM-PC 

A,  B,  D,  F,  H 

234 

BLSS 

BLSS  is  an  interactive  statistics 
package  supporting  matrix  operations. 

UNIX 

workstations 

A,  B,  C,  D,  J 

233 

BMDP 

BMDP  provides  a  comprehensive 
collection  of  statistical  routines  in 
addition  to  a  database  management 
facility,  a  full  screen  editor  and  graphic 
facilities. 

Various 

A  B,  C,  D,  E,  F, 
G.H,I 

233. 

234 

CLAM 

CLAM  is  an  interactive  environment  for 
matrix-based  computations,  eigen  values, 
eigen  vectors,  fast-Fourier  transforms,  etc. 

Various 

A,B,C 

233 

CLASP 

CLASP  is  a  ioo\  tailored  for  cluster  and 
multivariate  analysis. 

SUN 

C,E 

233 

CSS 

CSS  is  a  menu-driven  facility  providing 
an  extensive  collection  of  statistical  routines 
in  addition  to  a  database  management  facility 
and  a  spreadsheet-like  editor. 

IBM-PC 

A  B,  C.  D.  E,  F, 
G,H 

234 

GLIM 

GLIM  is  an  interactive  statistics  package  that 
provides  a  facility  for  interfacing  with  user- 
supplied  FORTRAN  subroutines. 

SUN 

B,D,  G 

233 

IMSL 

Libraries 

The  IMSL  library  is  a  collection  of  over  800 
FORTRAN  subroutines  to  support  statistical  analysis 
and  other  areas  in  applied  mathematics  such  as 
eigen  system  analysis,  linear  systems,  differential 
equations,  matrix/vector  operations  etc. 

Various 

B,  D,G 

233, 

235 

MathStation 

MathStation  is  a  general-purpose,  interactive  tool 
that  supports  statistical  analysis.  MathStation  can 
interface  to  FORTRAN  subroutines  and  libraries. 

SUN 

A,  B,  D,  G 

233 

MATLAB 

MATLAB,  though  oriented  for  matrix-based  Various 

computations  provides  a  limited  statistics  capability. 

MATLAB  supports  matrix  operations,  eigen  values, 
eigen  vectors,  fast-Fourier  transforms,  spectral  analysis, 
convolution  etc. 

C,D 

233, 

234 

Minitab 

Statistical 

Software 

Minitab  provides  limited  collection  of  statistical 
routines. 

SUN 

A,  B,  D,  F,  H 

233 

Maximum 

Likelihood 

Program 

MLP  is  a  tool  for  fitting  probability  distributions 
to  observed  data. 

SUN 

D 

233 

NAG 

FORTRAN 

The  Library  contains  over  700  routines 

for  statistics  as  well  as  other  mathematical  areas 

Various 

A,  B,  D,  F,  G,  H,  J 

233 

Library  such  as  linear  algebra,  differential  equations, 

fast  Fourier  transforms,  interpolation  etc. 


Table  (I)  (continued). 


Tool 

Description 

Platform 

Capabilities 

Ref  No. 

NCSS 

NCSS  provides  a  collection  of  statistical 
routines  and  an  advanced  graphics  utility. 

EBM-PC 

A,  B,  C,  D,  E,  G,  H, 

I. 

234 

Prodas 

Prodas  provides  a  collection  of  statistical 
routines  in  addition  to  database  and  graphics 
utilities.  Prodas  can  be  run  either  interactively 
or  in  batch  mode. 

BM-PC 

A,  B,  C,  D,  E,  G,  H,  I 

234 

P-Stat 

P-Stat  provides  an  extensive  collection  of  statistical 
routines,  a  data  management  facility,  a  report  writer, 
and  a  command-generator  utility. 

Various 

A,  B,  C,  D,  E,  F,  G, 

233, 

234. 

RS/1 

RS/1  provides  a  limited  collection  of  statistical 
routines  and  supports  a  graphics  capability. 

Various 

A,  B,  D,  H 

234 

Sandie 

Sandie,  originally  developed  for  use  in  an 
educational  environment,  provides  a  limited 
collection  of  statistical  routines  and  supports 
a  multi-window  user  interfaces. 

BM-PC 

A,  B,  D,  G,  J 

233, 

234 

SAS 

SAS  provides  a  comprehensive  set  of  statistical 
routines  and  advanced  graphic  capabilities. 

Various 

A.  B,  C,  D,  E,  F,  G, 
H,  I. 

234 

Sigstat 

Sigstat  provides  a  comprehensive  set  of  statistical 
and  graphical  routines. 

BM-PC 

A,B,C,D,E,F,G,H.,I  234 

SORITEC 

SORITEC  provides  a  limited  statistical  capability 
and  supports  mathematical  functions  including  matrix 
algebra  and  analytical  differentiation.  SORITEC  can 
be  executed  interactively  or  in  batch  mode. 

SUN 

D,  G,H 

233 

Speakeasy 

Speakeasy  supports  statistical  correlation  and 
regression  analysis  and  provides  other  mathematical 
functions  for  solving  matrix  algebra,  set  algebra, 
linear  algebra,  and  differential  equations. 

SUN 

D,  G. 

233 

S-PLUS 

The  S-PLUS  package  provides  a  variety  of  statistical 
routines  and  graphics  facilities. 

SUN 

B,  C,  D,  F,  G,  H 

233 

Unifit 

Software  tool  for  fitting  probability  distributions  to 
observed  data. 

Various 

A 

235 

FIGURE  2  (SIMULATION  TOOLS) 


Di3cr4U  Bvant  Simuhtian 


Fdicurcs 


SimulaTion  langvagti 
(4.  6  7.  A  9.  IQ.  1 1 
Jl  14.  13.  IS) 


Paradigm 


Tool  &ii  /  Fu4niicnf 
(1.  3.  3.  3  13.  16. 
17) 

I _ 


Paradigm 


Nm  Object  0ri4nt4d 

a  S.  10.  11.  Jl 
14.  13) 


Ohjici  Orisnud 
(4.  6.  9.  13) 

I 

I  Prog.  Intarfac* 


Objid  Oridnted 

a  m 


Nan  Cbj4£t  Cri4ni4d 

Cl.  2.  3.  3  16.  17) 


Plow  Chan  0ri4ni4d 
(END  OP  BPANCJl) 


FIGURE  3h 


Evdnt  Orunt4d 

(EM)  OF 
BRANCH) 


PC 

(6) 


SiaumMTii  0ri4nJ4d 

(4.  6  9.  IS) 
i 

iMdhodohgr  1 

T-" . 


Prcfc*it  0rianT4d 
(4.  6  9.  IS) 


DEC 

(6) 


FIGURE  2  c 


j 

farm 

rz 

1  1 

1  1 

Acrivi^  Oritnud 

(END  OF 
BRANCH) 


it^de  Varied 
(4.  IS.  9} 


FIGURE  2d 


FIGURE  2b  (SIMULATION  TOOLS  Contd.) 


Programmmmg  ln>€rfac4 


Flow  Chart  Onentsd 
(7.  10.  II.  IS.  14.  13) 


Method 

^elocr 

1  1 

1 

Pro£4*4  OrisTit^d 

(7.  10.  14.  15) 


Bv47it  Onenrsd 

a  10) 


Acanfy  Ori^msd 
01) 


Uicr  Interfiles 


Avaiiabfe  Advanced 

(Graphici.  Ammanon 

Spp) 


^Platfirrm 


mde  yaris^ 

(PCt.  Woriitatiom  etc.) 
(7.  10.  14.  15) 


Ui4j  interface  | 

A:*^abU 

(11) 

SUN  ^atiom 
(11) 


Statemera  Oriented 

(S) 


Methodolagp 


Procea  Orieanud 
(S) 


]yter  haeifaee 

V~ 

Advanced 

(S) 


Wide  Varies 

(S) 


FIGURE  2c  (SIMULATION  TOOLS  Contd.) 


IProgramminff  Aagmanied 


C-M- 

(S) 

I 


Mstkodology 


I 

Svant  Onanted 

(S) 

_ I 


User  Inteyfi^ce 


Aviailable  (JT  Windows) 

a) 

I 


PloTform 


Sl}l^  Stations 

(S) 


SIMULA 

(12) 

I 

j  Metkodology\ 

Process  Oriented 
02) 


User  Intetfoce 


Available  (X  Windows) 
(12) 

I _ 


Platform 


PC 

(12) 


SUN  Stations 
(i2) 


FIGURE  2d  (SIMULATION  TOOLS  Contd.) 


Programming  languages  Augmented 


(I  Z  i  J,  16) 


Pascal 

or) 


Wde  J^nefy 
(I  S) 


Methi 

tdology 

r  1 

1  1 

Event  Oriented  Process  Oriented  Activity  Oriented 


PC 

06) 


Wide  VarieXy 
(2.5) 


Methodology 


Event  Oriented 

or) 


a  s) 

1 

a  s.  16) 

1 

(END  OF  BRANCH) 

rr 

Platform 

Platform 

- - - 

Platform 

1 

Not  Specified 

or) 


Key : 


represents  a  Decision  Box. 


Simulation  Tools  List : 

(1)  SMPL[125].  (2)YACSIMI105],  (3)  SimPack  [75],  (4)  SIMULA  [33],  (5)  CSIM  [184  ],  (6)  SIM++ [224], 

(2)  SIM.AN[160],  (8)SIMSCRIPT[191],  (9)  MODSIM  [27],  (10)  SLAM  II  [153],  (1 1)  HOCUS  [163],  (12)  DEMOS[163], 

(3)  F.AST[179].  (14)GPSS[37],  (15)  INSIGHT  [173],  (16)  SimCal  [129],  (17)  SIMTOOLS  [177],  (18)  Smalltalk  [113]. 
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Measurement  Tools  List : 

(1)  TMP  [220],  (2)  Vox  8800  [47],  (3)  Zahlmonitor  4  [61],  (4)  DLA  [204],  (5)  RKMS  [43],  (6)  TRAMS 
[139],  (7)  ATUM  [193],  (8)  Berkeley  UNIX  Monitor  [137],  (9)  gpro/ [89],  (10)  Monit  [111],  (11)  Radar 
[119],  (12)  PCA  [62],  (13)  Fortran  Analyzer  [123],  (14)  Parasight  [18],  (15)  PSE  [228],  (16)  PIE  [120], 
(17)  JADE  ,  (18)  INCAS  [220],  (19)  SparcWorks  [229],  (20)  JEWEL  [117],  (21)  SPY  [214],  (22)  TX-2 
[148],  (23)MemSpy  [132],  (2A)A'fTOOL  [86],  (25) /PS-2  [138],  (26)  Annai  [48],  (21)  AIMS  [226], 

(28)  «;47’[164],  (29)  MEDEA  [134],  (30)  SP  [140],  CiDImbench  [229],  (32)  SyniblEL  [229], 

(29) .-lV:V/m’[230]. 
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T43.J/  Lcnguag* 
(25,  24.  27) 


Graphical 
(20,  26) 


Grapfoe^  IXSP 

(2S)  (23) 

(i)  QNAP  [21 23,  (23  MODLINE  [164],  (3)  MACOM  [I  Ml  (4]  CAPRES  [1081  (5)  GIST  [1921  (€)  PAW  [1331  (7)  RESQ  [1271  (8]  G«*tSPH  [461 
(9)  DSP  [1221  CIO)  QPN  pS],  (11]  ADAS  [41  (12)  Mod^  [351  (13)  SARA  [661  (14)  SiinP*r  (941  (15)  PARET  [1501  (16)  TIPP  [881 
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